{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 11 - Iteration and Sampling distributions\n", "\n", "For this lab, we will use a list of the top 1000 movies on [IMDb](https://www.imdb.com) compiled by [Kevin Markham](https://www.dataschool.io/about/) several years ago. We also used this dataset in Lab 9. \n", "\n", "The data CSV file is on GitHub here: [imdb_1000.csv](https://github.com/justmarkham/DAT8/blob/master/data/imdb_1000.csv) To download, right click on Raw and save the CSV file.\n", "\n", "# Iteration (Loops)\n", "First we will look at iteration or loops, which is a way to repeat a section of code multiple time without retyping it. Try running the code below. What does it do?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i in range(10):\n", " print(\"Hello!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modify the code below to print `Good-bye!` 5 times:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i in range(10):\n", " print(\"Hello!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think the following code does? Make a guess and then run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i in range(4):\n", " print(\"Lehman\")\n", " print(\"College\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think the following code does? Make a guess and then run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(\"Starting...\")\n", "for i in range(3):\n", " print(\"In the loop\")\n", "print(\"Ending...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the code that is indented is repeated. Another name for iteration is a *loop*.\n", "\n", "Can you write code that uses a loop to display the following?\n", "\n", "`This is my\n", "loop\n", "loop\n", "loop\n", "loop\n", "loop\n", "!`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you write code that uses a loop to display the following?\n", "\n", "`This is a\n", "fancier\n", "loop\n", "fancier\n", "loop\n", "fancier\n", "loop\n", "isn't\n", "it?`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling and empirical distributions of statistics\n", "\n", "A *sampling distribution* of a statistic (mean, median, varaince, etc.) is the distribution of that statistic over all possible samples of the same size. Since it's impractical to compute all possible samples, we will take a lot of random samples and compute the statistic of each random sample to get the *empirical distribution* of the statistic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, we will import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the CSV file into a dataframe called `movies`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the dataframe was created properly by displaying it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the columns. Which columns contains quantitative data? These columns are the only ones we can compute the mean of. \n", "\n", "We are going to take 10 random samples of size 50 of our dataframe `movies`, take the mean star rating for each sample, and plot a histogram of these means. First we create an empty list to store the means. Type `means = []` below and run the code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, can you write the code to take a sample of size 50 from `movies` and compute the mean of the `star_rating` column in the sample? Look back at lab 7 if necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " sample = movies.sample(50)\n", "sample[\"star_rating\"].mean()\n", "
\n", "\n", "Great! Now try putting your code inside of a loop, so that it repeats 10 times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " sample[\"star_rating\"].mean()\n", "
\n", "\n", "To make the code print the means, we have to change the last line of code to use the `print()` function, like this: `print(sample[\"star_rating\"].mean())` Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " print(sample[\"star_rating\"].mean())\n", "
\n", "\n", "Now, instead of printing the means, we want to save them to our list `means`. We do this with the `append()` function: `means.append(sample[\"star_rating\"].mean())`\n", "\n", "Copy your loop below and change it to add the means to the list instead of printing them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " for i in range(10):\n", " sample = movies.sample(50)\n", " means.append(sample[\"star_rating\"].mean())\n", "
\n", "\n", "You can see the means by typing `means` below, which will display the contents of the list." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a histogram of these means, we have to first convert the list into a Pandas Series, and then we can make the histogram. Type `pd.Series(means).hist()` below and run the code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try this again, but with 200 samples so that we get a better histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " \n", "means100 =[]\n", "for i in range(10):\n", " sample = movies.sample(50)\n", " means100.append(sample[\"star_rating\"].mean())\n", "pd.Series(means100).hist()\n", "
\n", "\n", "What do you notice?\n", "\n", "Now let's take 100 samples of size 50 and take the variance of each sample instead of the mean. What does the histogram (empirical distribution) look like? Does it have the same shape as for the mean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Challenges:\n", "- What happens if you leave the number of samples the same, but increase the size of the samples?\n", "- What does the sampling distribution of the median star rating look like?\n", "- What do the sampling distributions of the mean, median, and variance of the duration look like? How does this compare to sampling distributions of the mean, median, and variance of the star ratings?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }